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Abstract 

The goal of this paper is to provide a cohesive description and a critical comparison 
of the main estimators proposed in the Uterature for spatial binary choice models. The 
properties of such estimators are investigated using a theoretical and simulation study. To 
the authors' knowledge, this is the first paper that provides a comprehensive Monte Carlo 
study of the estimators' properties. This simulation study shows that the Gibbs estimator 
( LeSage[ 2000| l performs best for low spatial autocorrelation, while the Recursive Impor- 



tance Sampler ( |Beron and Vijverberg[|2004) performs best for high spatial autocorrelation. 
The same results are obtained by increasing the sample size. Finally, the linearized Gen- 
eral Method of Moments estimator ( |Klier and McMillen) 2008| ) is the fastest algorithm that 



provides accurate estimates for low spatial autocorrelation and large sample size. 



1 Introduction 



In applied work in economics and political science, there is increased attention to the impor- 
tance of spatial or network interdependence between observations. Not only does this vio- 
late the assumption of independence underlying many econometric methodologies for cross- 
sectional data, there is also growing interest in estimating the strength of the interdependence 
itself. While the econometric literature on linear regression models with spatial interdepen- 
dence is well established, in particular since the publication of Anselin ( |1988[ )'s seminal work, 
the literature on regression models with binary dependent variables and spatial interdependence 
is still relatively limited. 

Many applications with such models can be considered - including the contagion of cur- 
rency crises (Novo} 2003), firm-level decision-making on locations (Autant-Bemard. [2006| ), 



ecological studies of spatial distributions of plants ( [CoUingham et al4|2000| ), studies in policy 
diffusions of flat taxes (Baturo and Gray[[2009| ), anti-smoking laws ( Shipan and Volden] |2006| ) 
or pension privatization ( Weyland j | 2007[ ) - across academic disciplines such as economics, 
political science, sociology, ecology, planning, or even neurology. 



*Corresponding author. 
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Proximity in this context can be interpreted in a broad manner. Whether one defines prox- 
imity in a physical or in a cultural or interaction sense, or in a manner that encompasses large 
distances or the entire space (all units affect all other units), the estimation challenges discussed 
in this article still holdT] The conclusions can thus be directly applied to social network anal- 
ysis as well as spatial econometric analysis. Anselin ( |2002 , 255) refers to this perspective as 
the object view or this type of data as lattice data. The alternative, a geostatistics perspective, 
where we observe only specific monitoring sites and space is seen as a continuous space or a 
point pattern ( Bivand[[1998| ), leads to an entirely different econometric framework and will not 
be discussed in his article. 

Spatial econometric models raise new difficulties that cannot be dealt with by standard 
econometric models. Estimation problems arise due to the dependence across observations, 
in that we must adjust the estimation procedures for the loss of information associated with 
dependent observations. Indeed, in the presence of spatial dependence, standard logit or probit 
estimation procedures, which assume independence, result in inconsistent and inefficient esti- 
mates ( |McMillenl|1992| ). In particular, |McMillen| ( | 1 992) notes that both the spatially dependent 
error model and the spatial lag model imply heteroskedastic disturbances, which cause the pa- 
rameter estimates to be inconsistent. For these reasons econometricians began to pay more 
attention to spatial dependence problems in the last two decades and some important advances 



have been made in both theoretical and empirical studies ( [Anselin, Florax and Rey[|2004| ). 

The aim of this article is to compare the main estimators proposed in the literature for 
estimating the spatial autocorrelation parameter in binary choice models. On the one hand, this 
goal is achieved by analysing the theoretical characteristics of the main estimators for spatial 



models for binary response data. This topic has been in part developed by |Fleming| ( |2004[ ) but 
we consider also the recent literature. Moreover, our paper is focused only on binary choice 



models, instead ,Fleming| p004| ) has considered discrete choice models. On the other hand, the 
most innovative aspect of this work is the comparison of the above-mentioned estimators by 
Monte Carlo simulations. To our knowledge, this is the first work that performs Monte Carlo 
simulations on the main estimators of the spatial autocorrelation parameter for binary response 
data. The importance and the necessity of this analysis is strongly suggested by Fleming| ( [2004| ). 

Currently, the most used methodologies available to estimating spatial regression models 
are five. McMillen (1992]) proposes an EM algorithm based estimation procedure. In particu- 
lar, [McMillen ( 1992) replaces the latent continuous variable with its expected value and then 
applies the maximum likelihood method (Ord 1975[ ). Similarly to McMillen] ( 1992| ), LeSage| 
(|2000) also replaces the latent continuous variable with its expected value, solving thereafter a 



spatial continuous model using the Gibbs sampling approach. Following the work of Vijverberg 



( [1997 ) on the simulation from a multivariate normal distribution, Beron and Vijverberg ( 2004[ ) 
suggests to apply the recursive importance sampling (RIS) to the maximum likelihood method, 
since the likelihood function is a multivariate normal distribution. ' Pinkse and Slade (1998) 
develop a model based on the generalised method of moments (GMM). Klier and McMillen] 
( |2008 ) linearize .Pinkse and Slade (JT998 )'s model around a convenient starting point. 

The present paper is organized as follows. The next section reviews the widely used spec- 
ifications of the binary choice models with spatial dependence. In section 3 we analyse and 
compare the main methodologies proposed in the literature to estimate the spatial autocorre- 



'The complications with the interpretations of the observed effects increase, however, since factors that affect 
the similarity are also likely to have an effect on the linkages between the units. See Shalizi and Thomas ( 2010| l 
for a discussion of the inherent confounding of homophily and contagion mechanisms. 
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lation parameter in binary response models. In section 4 we compare the properties of these 
estimators by Monte Carlo simulations. The last section concludes. 



2 Spatial binary choice models 



A widely used representation of a regression model for an observed dichotomous response 7, 



is the latent response model (Verbeek, 2008 p. 180) with dependent variable the continuous 
variable Y-*, whereby 

' 1, Y*>0 
0, otherwise, 

with z = 1 , 2, . . . , n. A linear model is specified for this latent response, so the model specifica- 
tion is 



(1) 



pWY*+X/3 + d 
ASd + e, 



(2) 



where Y* is a continuous random vector, X represents annxk matrix of explanatory variables, 
the error term e can follow a multivariate normal distribution in a probit model or a multivariate 
logistic distribution in a logit model. W and S are spatial lag and spatial error weights matrices, 
respectively, p and A the associated scalar parameters. We highlight that only the latent variable 
can be used for the spatial lag, since both the models Y* = pWY + X/3 + e and Y = pWY + 
X/3 + e are infeasible (jAnsehnj [2002} [Beron and Vijverberg[ [2004t |Klier and McMillenj [2008| ). 

Evidence of the absence of a consolidated literature is given by the different denominations 
of the models - we follow [LeSage ( |2000 )'s notation. From the general model (|2]) two models 
are derived. Setting S = produces a spatial lag model, which we will refer to as the Binary 
Spatial AutoRegressive model (BSAR): 



where 



Y* = (/-pW)"i(X/3 + e) = {l-pW)-^X/3 + e, 



e = (I-pW)"^e. 



(3) 



(4) 



Letting W = results in a regression model with spatial autocorrelation in the disturbances, a 
spatial error model which we label the Binary Spatial Error Model (BSEM): 



Y* =X/3 + (I-AS)"'e = X/3 + u 



where 



u 



e. 



The two models are based on different assumptions about the causes of the spatial depen- 
dence!^ The spatial lag relates to an explicit spillover effect where one agent copies behavior 
from neighboring agents. It also relates to a theoretical model where the behavior is dependent 
on shared resources between different agents. The spatial error model concerns different causal 
relationships. For example, a typical issue that leads to spatial correlation in the errors is a 
mismatch between the spatial delineation of the measurement and the empirical presence of the 



For a clear interpretation of the spatial lag and spatial error models, see Case ( 19921. 
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variable of interest. For example, when studying the presence of a particular natural resource 
in particular countries, the geographical zones in which this resource is present do not usually 
match exactly with the country borders. A measurement of the presence of these resources in 
countries is thus necessarily spatially correlated, but as a nuisance rather than in a theoretically 
interesting sense. Another common cause of spatial autocorrelation in the errors is an omitted 
variable that is itself spatially correlated. In terms of estimation, the two types of autocorre- 
lation are often difficult to distinguish ( [Brueckner , 2003 [ 184-185). The different theoretical 
mechanisms are of course not mutually exclusive and a spatial model that incorporates both a 
spatial lag and spatial residuals is perfectly reasonable. 

In this paper we are primarily interested in estimating diffusion effects, and thus our focus 
is on the estimation of the spatial autocorrelation parameter p. For this reason, in this work we 
only analyse models with spatial lags (BSAR) and leave spatial errors (BSEM) aside. 

The contiguity or weight matrix W is defined by 



Wij 



1 if the /-th and 7-th observations are contiguous; 

if z = J or the z-th and 7-th observations are not continguous. 



so it is a square matrix of order n and its main diagonal elements equal to zero. Contiguity 
can refer to geographical and alternative vicinity. The use of the weight matrix W implies that 
the spatial sites form a countable lattice ( [Lee , 2004[ ), but part of the literature considers a con- 
tinuous spatial index ( |Conley[ 1999| ). Because of the potential of heteroscedasticity due to the 
variation in the number of neighbors for different observations, W is commonly normalized as 
follows Wij/ {Y^jWij) for z, J = 1,2, ... ,n. This means that the normalized matrix W is gener- 



ally asymmetric, while the original weight matrix W is often symmetric |j Although this is the 
common approach, there are various other ways of defining and normalizing W ( ,Tiefelsdorf[ 
2000[ [Anselin . 2002[ ). Since the aim of this paper is the comparison of the main methodologies 
to estimate the spatial autocorrelation parameter p, we consider for all of them the normalized 
matrix w|l] 

For binary dependent variables, the most used models are the logistic and the probit models 
( [McCuUagh and Nelder[fl989j . In the next section we analyse and compare the main estimators 
of the autocorrelation p arameter in both spatial probit (|Beron and Vijyerberg[ |2004[ |LeSage[ 
2000[ |McMillen[ |l992| ) and logit ( jKlier and McMillenj |2008| ) modelsf] 



Novo 



( 2003 I considers an asymmetric non-normalized W. 



When the normahzed contiguity matrix W is considered, to ensure the invertibility of the matrix (/ — p W) 
in the maximum HkeHhood method, Anselin (1982i proves that l/cOmi„ < p < 1 where a),„,„ is the minimum 
eigenvalue of the contiguity matrix W. 

^There are a number of related estimators that, for various reasons, will not be included in the discussion 
and Monte Carlo analyses. These estimators are related, but make assumptions about the data that are beyond 



the scope of this paper For the spatial random effects probit (Case 1991 1992 1, when W is constrained to be 



block-diagonal, in other words, when the focus is on membership of a particular geographic region or cluster 
of units rather than some kind of proximity measure, the spatial model can be substantially simplified ( jCase 



1991[[T992) l. The logistic auto-logistic ( [Gumpertz, Graham and Ristaino[ [r997[ |Bee and Espa| [2008 ) applies to 
data on a regular grid, which is not apphcable in the type of diffusion studies we have in mind in this paper 
Dubin| ( [T995| l's spatial logit model is a straightforward diffusion model that avoids most complications of spatial 



models by using the temporally lagged, realized dependent variable to create the spatial lag. McMillen (1992 



1995Z?| l's heteroscedastic probit using weighted least squares applies to the spatial error model, but not the spatial 
autoregressive model we discuss in this paper. 
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3 Estimators for binary spatial autoregressive models 



3.1 Expectation-Maximization algorithm 



The Expectation-Maximization (EM) algorithm is designed for cases where the data is incom- 



plete, for example due to missing values (Dempster, Laird and Rubin 1977). Since the probit 
model can be viewed as a latent response model, and this latent variable is similarly unobserved. 



McMillen ( 1992 1 proposes to apply the EM algorithm to the probit model with spatially lagged 
dependent variables and spatial error autocorrelation. In particular, the latent unobserved obser- 
vations y* are replaced by estimated values. Given estimates of the values y*, the EM algorithm 
proceeds to estimate the other parameters in the model using the maximum likelihood method. 

In the EM algorithm the assumption of homogeneity for the disturbances e is introduced. 
This means that the error term e can follow the n-dimensional multivariate normal distribution 
e ~ Nn{0, CTgl) in a probit model. The variance of the error term is indeed 



var{e)=var[{I-pW)-'^e\ = [{1 - pW)' {1 - pW 



Let 



D 



(5) 



(6) 



: diag{(Te) 

be the diagonal matrix with diagonal elements erg that represent the root square of the diagonal 
elements in the matrix ([5]) and 



q = B-\l-pWy^Xl3. 



(7) 



Since /3 and cannot both be estimated in probit models, McMillen (1992) assumes 



ag = 1 . In the E-step, the observed dependent variable is replaced by the expectation of the 
latent variable Y* conditional on the observed dependent variable Y, making use of generalized 
residuals (Cox and Snell 1968, Chesher and Irish[ 1987| ). To compute this expectation in the 
first iteration, the starting values of the parameters f3 and p are used, in subsequent iterations, 
the estimated parameters. By computing the conditional expectation of equation (|3]), in the 
E-step the following result is used 



E[Y*/Y = y] = {l-pWy^X/S + D 



4>„(q)[l-4>„(q)] 



(8) 



where ^„(-) and ^n{-) denote respectively the n-dimensional multivariate probability density 
and cumulative distribution functions of a standard normal. 

Subsequently, setting = 1 in the M-step, new estimates are obtained by maximizing the 
log-likelihood function 

k-\[{l- pW)y* - X/3]'[(I - p W)y* - X/3] + £ Zn( 1 - pco,) , 
where £0/ are the eigenvalues of W. HLi ( 1 — p w, ) is a computationally efficient approximation 



of the determinant |I — pW| ^Ord 1975 1. This process is repeated until convergence]^ 



^To obtain a p estimate in the interval (-1,1), we apply the one-to-one transformation p = — 1 -I- 2<I>i (p*), 



making use of the invariance of maximum likelihood estimators (Davidson and MacKinnon 1993 p. 253-255). 
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The main advantage of this methodology is that it avoids to compute an « -dimensional in- 
tegral. The cost is that the E-step requires the calculation of the inverse of the matrix (I — p W) . 
Although this can be made slightly more efficient by using the eigenvalues of W to approxi- 
mate the inverse, it still slows down the algorithm considerably. In addition to the computa- 
tional burden in the implementation of the algorithm, the main drawback of this proposal is the 
covariance matrix estimate of dimensions nx n. By considering the spatial probit model as a 
non-linear weighted least squares model, McMillen (1992]) obtains biased but consistent esti- 
mates of the covariance matrix. For this reason McMillen] ( | 1995a I explores computationally 
simpler alternatives to the methods in McMillen ( 1992), expressing the belief that the methods 
proposed in McMillen (1992) are impractical for large sample sizes. Another problem with 
McMillen ( 1992[ )'s approach is the need to specify a functional form for the nonconstant vari- 



ance over space ( [LeSage . 2000[ ). In larger models a practitioner would need to devote consid- 
erable effort to testing the functional form and variables involved in the model for the variance 
of the noise elements £, . Finally, the EM approach cannot provide an estimate of precision for 
the spatial autoregressive parameter p . 



3.2 Gibbs sampling 



The Gibbs sampler is a particular Markov Chain Monte Carlo (MCMC) introduced by Geman 



and Geman ( 1984| ) in the context of image restoration. When a direct specification of a joint 



distribution is not feasible, the Gibbs sampling procedure specifies the complete conditional 
distributions for all parameters in the model and proceeds to sample from these distributions 
to collect a large set of parameter draws. During sampling, a conditional distribution for the 
latent observations y* conditional on all other parameters in the model is consideredj^ This 
distribution is used to produce a random draw for all y* in the probit model. The conditional 
distribution for the latent variables takes the form of a normal distribution centered on the 
predicted value truncated at the left at if yt = 1 and at the right at 1 if yt = 0. 

The Bayesian Gibbs sampler approach to estimating spatial discrete choice models (both 
BSAR and BSEM models) is proposed by |LeSa ge' (2000) and is an extension of the Gibbs 
sampling method suggested by Geman and Geman (198 4 )p| This method exhibits a similarity 



to the EM algorithm, where the latent unobserved observations on the dependent variable yj are 
replaced by estimated values. The Bayesian approach is different in the way it formulates the 
likelihood function and the estimates of the unobserved latent variable. The Gibbs estimator 



remedies the two limitations of McMillen] ( |T992) 's EM estimator, its slow convergence and its 
bias in the estimation of standard errors. 

It is important to underline that [LeSage (2000) relaxes the assumption of homogeneity 
for the disturbances e used in BSAR and BSEM models. This means that the error term e 
can follow a multivariate normal distribution e ~ A^„(0, CgV) in a probit model, where V = 
diag{vi,V2, ■ ■ ■ ,Vn) and v, with / = 1,2, are the variance parameters to be estimated. 
|Greene| (2003 ) points out that accounting for heteroskedasticity is important for probit mod- 
els because the estimates are inconsistent in the presence of nonconstant disturbance variances. 

In order to assign the priors of a BSAR model, LeSage] (2000) assumes that the priors are 



Gelfand and Smith 



( 1990|l demonstrate that Gibbs sampling from the sequence of complete conditional distri- 



butions for all parameters in the model produces a set of estimates that converges in the limit to the true posterior 

dist ribution of the parameters. 

* Bolduc, Fortin and Gordon ( 1997 1 take a similar approach for the closely related spatial ordinal probit model. 
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independent 
where 



;r(p,/3,a,V) = n{p)n{(3)7t{a)7t{\), 
7t(p) cc constant 



n{vi'/q) 



1,2,, 



Tt(<J 



oc 



1 

o' 



The parameter q controls the amount of dispersion in v,, with small values of q pr oducing 
leptok urtic distributions and large values imposing homoskedasticity]^ We summarize LeSage 
(2000)'s algorithm by the following steps 
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1. Initial values for the parameters po, I3q, Oq, Vq are considered. The residuals eo are 
computed by substituting these values in equation Using a random draw from X^i^) 
the following value is determined: 



2. Given the values po, Vq, Ci, the parameter /3i is drawn from the multivariate normal 

/(/3i/po,Vo,cTi) ^Nn [(X% iX)-iX% i(I-pW)y*,ai2(X% iX)-i] . 

3. By drawing an n-vector of random X^{(1~^ 1) ^^id by using po, /3i, and 0\, the values v,- 
with / = 1 , 2, . . . , n are computed with 



Vi = 



^1 



4. By knowing the values (3i, 0\,\\, the metropolis sampling algorithm ( |Hastings[|1970| ) 
is applied to determine pi. The conditional posterior for p given /3i, 0\, \\ is 



/(po//3i,(Ti,Vi)oc|I-pW|exp|-^e'iVr^ei|. 



(9) 



Let the value p* = po + cZ be generated, where Z is a draw from a standard normal 
distribution and c is a known constant^ ^ The acceptance probability p = min{\, ^j^^}^ 
where /(•) is defined in equation ([9]). A value m is drawn from a continuous uniform 
distribution with support [0, 1]. If m < the next draw from the density function ^ is 
given by pi = p*, otherwise the draw is taken to be the current value pi = po- 



= 7 produces estimates similar to logit and use of a large value, e.g. q=100, produces estimates similar to 
those from probit. 

'''We foll ow [Thomas (2007 1's implementation of LeSage (2000 1's methodology, which follows the suggestion 
by Fleming ( [2004 p. 159) to transform the latent variable into one that is distibuted independently by using the 
Cholesky root of the inverted error covariance matrices (cf. matrix A in the RIS estimator below). 

'^For the Monte Carlo simulations in this article we set c = 0.1. 
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5 . The values of the latent dependent variable y* are sampled from the multivariate truncated 
normal distribution 

Y*~A^J((I-piW)-iX/3i,A), 

where A is the diagonal matrix whose elements are the elements of the main diagonal of 
the matrix (I — piW)^^ee'[(I — piW)^^]'. The normal distribution is truncated at the left 
at if F = 1 and truncated at the right at if F = ([Albert and Chib[|1993"|). 



LeSage (2000 1 's approach overcomes the problems in estimating the standard error in the 



EM algorithm since parameter standard errors are derived from the posterior parameter dis- 
tributions. The first advantage of the Bayesian strategy is to be able to derive the condition 
distribution of each parameter, and thus compute different moments of the distribution. The 
second advantage is its flexibility to account for the heteroskedasticity in the error terms. 



3.3 Recursive Importance Sampling 



Beron and Vijverberg ( 2004| ) propose a recursive importance sampling (RIS) estimator to eval- 
uate directly the n-dimensional integral in both the BSAR and the BSEM models. The RIS- 
normal simulator is identical to what is sometimes called the Geweke-Hajivassiliou-Keane 



(GHK) simulator ( [Borsch-Supan and Hajivassiliou[ |1993| ) 
Define Z as an n x n matrix with 



1 




if i = j 
otherwise. 



for i,j= 1 , 2, . . . , n. This means that Z is a diagonal matrix that satisfies the equation ZZ' = I„. 
By defining t = Ze, we obtain that 

yar(t) =Zyar(e)Z' = Sp, 

where Var{e) is provided by equation (|5|. By this notation the observed vector y defined in ([T]) 
leads to an upper limit on t: 

t< -Z(I-pW)"iX/3, 
which means that we can write the log-likelihood function as 

lnL = ln4>4-Z(I-pW)-iX/3;0,Sp] = ln$4T;0, Sp] , (10) 

where 4>„ [j ; , fl] is a n-dimensional normal cumulative distribution function with mean vector 
jU and variance-covariance matrix Q. 

In order to evaluate the probability in equation ([10]), Beron and Vijverberg| (2004 ) propose to 
apply the RIS simulator, developed in detail by | Vijverberg ( |1997| ). Let A be an upper triangular 
matrix such that A'A = Ep ' and let rj = At. Whether Ep is standardized or not, the vector rj 
is i.i.d. standard normal. By defining the matrix B = A^^, B results an upper triangular matrix 



with bjj > Vj and Bt] = t. 



Given the upper bound {Br] = t} < T, we can apply the following iterative procedure: 

Vn < b„,lTn = rinO 



< ^1 



(11) 
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Let gi^ij) be a probability density function with support the whole real axis and let G(-) be the 
associated cumulative distribution function. By denoting 

for rij < rjjQ, we can compute the following probability: 



/■T rri„o /-TJi.o " 

p = P{t<T}= ^{t;0,fl)dt= ... Yl^{rij)drii...dri„ 

J — oo J OO J OO j I 



^2,0 ,^(772 



^{riio)g'{ri2)dri: 



g''{rin)drin. 



(12) 



The RIS simulator is implemented by drawing a large number R of random vector rj sat- 
isfying the condition rjj < rjjo for j = 1, 2, . . . ,n from the density function g{-). 



12 



There are 



different suitable density functions used to define g{-) ( jVijverberg . 1997). Vijverberg (1999) 
shows that the RIS-normal simulator is often preferred. For this reason we choose the normal 
density function in the following Monte Carlo simulations and, in particular, we apply the an- 
tithetical sampling strategy suggested by [Vijverberg ( 1997 1 for simulating from a multivariate 
normal distribution. 

The recursive nature of the RIS simulator is due to the fact that the bounds in equation 



(11) are backwards determined. For every drawing r of the random vector ry, given T]„o, the 
values ffn r and f]„_i o,r are calculated using equation ( 1 1 1 by using f]„^ in the place of T]„. This 
process is repeating until f7i,o,r is computed. Then for the RIS-normal simulator the simulated 



value for p, defined in equation ( 12), is 



= 1 



where ^»(-) is the cumulative distribution function of the one dimensional standard normal 
random variable. 

Based on the Monte Carlo study that Beron and Vijverberg ( |2004| ) performed the RIS simu- 
lator can provide accurate estimates for spatial binary choice models. Moreover, this approach 
is attractive since it is the only one that directly evaluates the n-dimensional probit likelihood 
function. This means that only this methodology allows for the use of the Likelihood Ratio test. 
Because of these advantages |Beron, Murdoch and Vijverberg| ( [2003 [ ) and |Novo| ( |2003| ) apply 
the RIS simulator. 



3.4 Generalized Method of Moments 

This section describes a spatially dependent binary choice methodology that considers the prob 
lem as a weighted non-linear version of the linear probability ( Amemiyal|1985 ; Greene[ 2002 



Maddala 



1983| ) with a variance-covariance matrix that can be estimated with a Generalized 

Pinkse and Slade' ( |1998| ) derive the 



Method of Moments (GMM) estimator (Hansen 1982) 



GMM moment equations from the likelihood function. [Klier and McMiUenj ( |2008| ) propose a 
linearized version of the GMM suggested by Pinkse and Slade, ( ,1998| ). 



12 



For the Monte Carlo simulations we use R = 1000. 
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3.4.1 Pinkse and Slade's estimator 



While [Pinkse and Sla"del(|1998|) suggest to apply the GMM to a BSEM model, for achieving the 



aim of this article we pr esent their estimator for a BSAR model. Similar to McMillen ( 1992 ), 



Pinkse and Slade| ( |1998[ ) consider the generalized residual^ 
e{e)=D-'E[e/y,e[ 



(13) 



4>„[q(0)]{l-4>„[q(0)]}' 

where 6 = {f3',py is the parameter vector and D and q are defined in equations ^ and (jv]), 
respectively. 

By applying the GMM the parameter vector is estimated by 



e 



argmin e'(0)ZMZ'e(0), 
6>e© 



(14) 



where e is defined in equation (13), Z is a matrix of instruments^] M is a positive definite 
matrix and is the parametric space. 



Pinkse and Slade (1998) provide the asymptotic variance of their estimator for a BSEM 



model and develop also the hypothesis test for spatial error correlation. Their approach over- 
comes the problems of evaluating a high order integral and the n by n determinants in the 
Maximum Likelihood method. The main disadvantage of this approach is t hat it requires the 
nx n matrix (I — pW)^^ to be inverted in each iteration. Furthermore, since 



Pinkse and Slade 



( 1998 ) apply the GMM method, their estimator is less efficient than the ML estimators. 



3.4.2 Klier and McMillen's estimator 



Klier and McMillen (2008) linearize Pinkse and Slade (1998)'s model around a convenient 



starting point for a BSAR logit model. In particular, in equation (14), Klier and McMillen 
(2008) let M = (Z'Z)"^ so the objective function for the GMM estimator is 

= argmin e'(e)Z(Z'Z)-^Z'e(e), 
ee® 

hence 'Klier and McMillen' (2008) apply a nonlinear two-stage least squares method. In order 
to analyse |Klier and McMillen,(2008) 's methodology we define 



F = P{Y =1/0} 



exp[q(g)] 
l+exp[q(0)]" 



(15) 



where q{6) is defined in equation 

[Klier and McMillen (2008|)'s iterative procedure has the following steps: 

1 . assume initial values for the parameter vector = (/^q, Po)'; 

2. compute eo defined in equation (|4]); 



^^Unlike [Chesher and Irish| ( |l987| and jCox and Snell| ( |l968| l (see also eq. (fs])), [Pinkse and Slade| ( fT998| define 
the generalized residuals as D^^£'[e/y,e,p] and not E[e/'y,e,p]. 

' ^In the Monte Carlo simu lations we consider Z = 1 + X + WX + W^X + W^X. 
Pinkse and Slade ( 1998 1 consider M equal to the identity matrix M = I„ in their empirical appUcation and we 
follow this suggestion in the Monte Carlo simulations. 
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3. compute the gradient terms 



Gpi 



dp 



Pi{\-Pr) I 



(16) 



where t, is the ?-th row vector of the matrix T = D^^ (I — pW)^^X, hi is the ?-th element of 
the vector h = (I — pW)^^Wq, qi is the ?'-th element of the vector q defined in equation 
^ and T,, is the i-th element of the diagonal of the matrix T = (I — pW)^^W(I 

pw)-i(i-pw)-ip 
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4. regress the gradient terms and Gp on Z and compute the predicted values and 



5. regress eo + G^/3o on G^ and Gp. The coefficients obtained from this regression are the 
estimated values of /3 and p. 

The main advantage of this approach is that it is not iterativ e and does not require th e 

1 



pW) to be inverted in each iteration, unlike 



Pinkse and Slade 



(1998l's 



nx n matrix (I 

estimator. This characteristic leads to a computationally significantly faster estimator. The 
main disadvantage of this estimator is that it provides accurate estimates of p only as long 
as p is small. Furthermore, linearized approach cannot provide an estimate of precision for 
the spatial autoregressive parameter p. Finally, since Klier and McMillen (2008) propose a 
linearization around the starting point, a restriction for the parameter p to the interval [-1,1] 
cannot be introduced by using their method. 



4 Monte Carlo simulations 



In order to make up for the lack of simulation studies for BSAR models ( Fleming! 2004), in 
this section we compare the properties of these five estimators by Monte Carlo simulations. 
The set up of the simulations is primarily based on the hterature on policy and regime diffusion 
(e.g. Gleditsch and Ward 2006[ Baturo and Gray, 2009) and on broad similarity with simu- 
lations as published in accompaniment of the proposals of estimators discussed in this paper 
(e.g. |McMillen[ |1995Z?[ |Beron and Vijverberg"! , |2004[ [Klier and McMillen[ |2008[ ). In order to 
understand how the properties of these estimators vary according to the number of observa- 
tions, we consider two different sample sizes: n = 50 and n = 500. The first sample size is 
set because it resembles the number of states in the US, which is a typical application area for 
studies in policy diffusion (e.g., Mooney[ 2001 ; Volden 2006 [LeSage and Parent 2007). The 
larger sample size is added to be able to see the results when the sample size increases. 



'^The derivative in equation ( 16 1 assumes that (I — p W) is a symmetric matrix, which is not guaranteed. For 
example, when W is standardized, this is not the case. See the appendix for this derivative without assuming a 
symmetric matrix. The revised derivative in the appendix does not affect the estimator or the Monte Carlo results 



when the convenient starting point at p = is used, as in jKlier and McMillen ( j2008^ 
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In our Monte Carlo analysis we generate 1,000 replications!^ For generating the data set: 
we consider one covariate X drawn from a normal distribution N{2,4) with expected value 2 
and standard deviation 4 Based on equation (js]), the residuals vector e is generated from 
a multivariate normal distribution A^„(0,I) and the parameter vector is /3 = [4, —2]'. In order 



to generate W, we apply the method suggested by Beron and Vijverberg ( 2004[ p. 179) by 
using d = 0.21 for n = 50 and d = 0.06 for n = 500. To analyse how the characteristics of the 
estimators change according to the level of autocorrelation, we consider four different values 
(0;0.1;0.45;0.8) of the parameter p, such that the last three values are equidistant^*^ For the 
maximization procedure in the EM, RIS, and Pinkse and Slade| ( |199 8) estimators, we use the 
optim ( ) function in R with a maximum number of iterations of 1,000. Finally, analogously 
to |LeSage] (|2000[) and|Beron and Vijverberg| ([2004[) we consider a maximum number of loops 



equal to 1,000 for the RIS and EM algorithms, and 3,000 for the Gibbs estimator. 

In the following tables we report the mean of the bias and the standard deviation of the 
estimators (in round brackets) computed on 1,000 sets. The data is generated based on a probit 



model. Since Klier and McMillen (2008 ) have proposed their estimator for the logit model, we 
rescale their parameter estimates to allow for comparison. Because the variance of the logistic 
distribution is 7r^/3, we report the estimates Poy/S/n and j8iV^/;r for the linearized GMM 
estimator" 
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The primary focus of this paper is on the estimation of the level of spatial autocorrelation 
in a binary spatial autoregressive model specification. Since we have the study of the diffusion 
of policies and regimes in mind, the level of diffusion is typically of key interest. The auto- 
correlation in the residuals is thus not treated as a mere nuisance, but as a structural factor of 
substantive interest. Figure [Tjprovides the distribution of the bias ofthe estimators of the spatial 
autocorrelation parameter p in the BSAR model described in equation ([3 Table [Tjprovides 
the mean and standard deviation of the bias of the above-mentioned estimators.. 

It is clear from Figure[T]that the performance of the estimators varies depending on the level 
of autocorrelation in the data, with particularly large differences between estimators under high 
levels of autocorrelation. As can be seen in Table [T] in the absence of spatial autocorrelation 
(p = 0), the Gibbs and the EM estimators are the best estimators of p in terms of both the 
distortion and the dispersion. The linearized GMM estimator also does particularly well, which 
is unsurprising, since p = is the value used as the starting point for the linearization. When 
looking at the distribution as a whole, however, it is clear that while this estimator performs 
generally well, there are clear outliers among the estimates. The RIS estimator shows the 
worst performance and it tends to underestimate p for both small (n — 50) and large {n = 500) 
samples, with in particular some negative outliers. The Pinkse and Slade| ( |1998 ) estimator, on 
the other hand, tends to overestimate p in this scenario. 

When the level of spatial autocorrelation is positive but still limited (p = 0.1), the EM and 
Gibbs estimators still show good performance, even if the differences with the other estimators 



I7n 



The same number of replications is considered by Flores-Lagunes and Schnier ( 2005 1; Franzese and Hays 



(2007 1; Klier and McMiUen (2008 1. 

^^We use th e R package r lecu yer for the parallel generation of random numbers. 



'^Following 



McMillen 



( 1995/? I, we prefer to co nsi der a standard deviation of X substantially higher than o^. 



Klier and McMillen 



(2008 1 consider similar values. 



^ lAnsehn| ( 1982| , |Beron and Vij verberg' ('2004') and 
•^^The spatial coefficient p is not affected by the scaling. 
^^These plots are generated using the violin function in the lattice package in R, which in turn makes 
use of the built-in density function for the computation of smoothed kernel density estimates. Default settings 
are used. 



12 



EM 
Gibbs 

RIS 
GMM 

GMM (lin) 
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EM 
Gibbs 
RIS 
GMM 
GMM (lin) 



n = 50 
rho = 



0.45 



0.32 



1.00 



0.75 



0.71 



1 



0.32 



0.32 



0.55 



0.86 



1.11 



n = 500 



rho = 



0.10 



0.10 



0.53 



0.25 



0.15 



f 
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I 



0.12 



0.11 



0.14 



0.18 
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n = 50 
rho = 0.1 



0.47 



0.42 



1.10 



1.00 



1.04 



0.25 



0.28 



0.47 



0.82 



0.98 



n = 500 



rho = 0.1 



0.14 



0.14 



0.46 



0.95 



0.17 
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0.07 



0.07 



0.13 



0.20 



0.14 



n = 50 
rho = 0.45 



0.57 



0.56 




1.45 



1.43 



0.73 



0.06 



0.13 



0.33 



0.38 



1.42 



n = 500 



rho = 0.45 



0.19 



0.20 



0.12 



1.32 



0.13 



-0.04 



0.02 



0.13 



0.11 



0.39 



n = 50 
rho = 0.8 




0.89 



2.19 



n = 500 



rho = 0.8 



0.35 



0.26 



0.25 



1.80 



0.24 



-0.20 



-0.03 



0.09 



0.20 



1.05 



-10 12 -10 12 -10 12 -10 12 

Bias 



Figure 1 : The distribution of the bias of the estimators of the autocorrelation parameter p obtained from 
Monte Carlo simulations on 1,000 samples. Numbers represent minimum and maximum values of the 
bias. 

are less considerable in comparison with the scenario without spatial autocorrelation. Both 
the EM and the Gibbs estimators tend to underestimate the spatial autocorrelation parameter. 
As can be expected, the linearized GMM estimator is still performing well this close to the 
linearization point of p = 0. The main disadvantage of this estimator is that it is not possible to 
put a reasonable constraint on p, such that occasional estimates are obtained outside the [—1,1] 
interval, visible in Figure [TJ The RIS estimator is also prone to the occasional outlier in its 
estimate of the level of autocorrelation. It is striking that all estimators, with the exception of 
GMM with n = 50, tend to underestimate p, often by over 50% of the true parameter value, 
when the sample size is low. 

The next scenario contains significant levels of autocorrelation, with p = 0.45. Under this 
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p=0 


p = 0.1 


p = 0.45 


p =0.8 




50 500 


50 500 


50 500 


50 500 


EM 


-0.002 -0.002 
(0.108) (0.034) 


-0.038 -0.016 
(0.108) (0.030) 


-0.148 -0.104 
(0.107) (0.022) 


-0.311 -0.272 
(0.098) (0.023) 


Gibbs 


-0.017 -0.003 
(0.108) (0.032) 


-0.050 -0.023 
(0.109) (0.030) 


-0.114 -0.058 
(0.120) (0.028) 


-0.140 -0.104 
(0.117) (0.030) 


RIS 


-0.051 -0.002 
(0.235) (0.056) 


-0.044 -0.003 
(0.219) (0.053) 


-0.027 0.003 
(0.158) (0.034) 


-0.015 -0.005 
(0.093) (0.039) 


GMM 


0.019 -0.001 
(0.165) (0.055) 


-0.049 -0.001 
(0.173) (0.060) 


-0.243 -0.097 
(0.228) (0.245) 


-0.547 -0.775 
(0.336) (0.458) 


GMM (lin) 


0.011 -0.001 
(0.149) (0.043) 


-0.037 -0.001 
(0.166) (0.044) 


-0.075 0.080 
(0.232) (0.072) 


0.028 0.468 
(0.528) (0.205) 



Table 1: The mean bias and the standard deviation (between parentheses) of the estimators of the 
autocorrelation parameter p obtained from Monte Carlo simulations on 1,000 samples. 



level of autocorrelation, the RIS estimator starts to perform relatively well compared to the 
other estimators. While the absolute mean bias is the lowest, the variation is still relatively 
high, however, and the plot clearly shows the presence of some outliers. For the EM and the 
Gibbs estimators, the bias is significantly larger than for p = 0. 1 and for the linearized GMM, 
a clear increase in the dispersion is visible in Figure [1} although this is compensated by a 
reduction in extreme outliers. The distribution of the bias of the GMM estimator now shows a 
clear tendency to underestimate the amount of autocorrelation, with a tight distribution under 
n = 500, but with some outliers. 

For high spatial autocorrelation (p = 0.8), the RIS estimator shows the best performance. 
The linearized GMM now clearly suffers from the large distance from the starting point of the 
linearization - the extrapolation from p = to p = 0.8 leads to significant overestimation of 
the level of autocorrelation. The plot also shows that the estimator clearly suffers from the 
lack of a constraint on p. The GMM estimator shows rather significant underestimation of p, 
as well as a high level of variance. Furthermore, under this scenario, the simulations suggest 
asymptotically biased in mean results for the GMM and the linearized GMM estimators, with 



the mean bias increasing for the larger sample size Following the trend already visible when 
moderately increasing p, the EM estimator clearly shows greater mean bias under this scenario, 
while for Gibbs, the results are similar to p = 0.45. 

Even with primary focus on the level of autocorrelation, the estimates of the effects of 
other independent variables can of course not be ignored. Table |2] provides the mean and 
the standard deviation of the estimates of /3i, the parameter for independent variable X. The 
differences in estimate accuracy between the estimators vary more dramatically than for p, 
with the Gibbs estimator clearly outperforming all other estimators under all conditions and the 
estimator proposed by lPinkse and Sladel ( |1998| ) clearly providing the worst results for both the 



•^^When the observations are "strongly spatially dependent" (Pinkse and Slade 1998 p. 134, fn. 12), even the 
consistency is not guaranteed. 
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bias and the dispersion. 

Under absence of spatial autocorrelation (p = 0) and for n = 50, the Gibbs estimator gener- 
ates the least average bias on /3i, with RIS, linearized GMM, and EM also performing well as 
long as the sample size is sufficiently large. Under n = 50, the Gibbs estimator underestimates 
/3i by about 50% on average of the value of /3i and the other estimators well beyond that. The 
difference between the smaller and the larger sample sizes is more pronounced than for the 
estimates of p. The GMM estimator is the only estimator that is still significantly biased when 
the sample size is reasonably large. All estimators tend to underestimate /3i . 

Under limited autocorrelation (p =0.1), the order of accuracy of the estimators remains 
more or less the same, with still only Gibbs performing the best under the small sample size and 
performing the well under the large sample size. The RIS performs similar to EM, Gibbs and 
GMMlin for large sample size. The GMM estimator shows the worst performance. Increasing 
the autocorrelation to p = 0.45, the results for the Gibbs, the RIS and the GMM estimators 
are very similar to p = 0.1, but the EM and the linearized GMM estimators show lower mean 
biases under small sample size. 

For the estimation of p, we saw significant difference between the moderate and the high 
levels of autocorrelation - to what extent is this the case for the slope coefficients? Similar 
to the estimates of p, the GMM estimator of Pinkse and Slade (1998) starts to show very 
significant distortion and dispersion when the autocorrelation is high and, furthermore, shows a 
significant increase in the mean bias when the sample size increases. The linearized version of 
this estimator also reflects this increasing mean bias with sample size for high autocorrelation, 
but nevertheless performs remarkably well, with an average bias of approximately 50 to 75% of 
/3i , underestimating the magnitude of the negative effect of X. The Gibbs estimator outperforms 
all other estimators both with small and large sample sizes, followed by the RIS, the EM and 
the linearized GMM estimators as long as sample size is large. It should be noted that for all 
estimators, the bias is relatively large, which is an overestimate of the magnitude by slightly 
over 10%. 

Usually of least concern to applied researchers, but relevant for accurate prediction, is the 
estimate of the intercept of the model, in this case Pq. Table [3] provides the simulation results 
for this parameter of the model. Not unsurprisingly, the results are closely in line with those for 
/3i. Indeed, the mean of the bias of /3o is roughly the bias in Table |2] multiplied by a factor —2. 
Relative to the size of j8, the bias is thus the same on average. Similarly, the standard deviation 
of the bias is multiplied by a factor 2, with the exception of the GMM estimator under higher 
levels of autocorrelation, where the dispersion under larger sample size is similar to that for /3i . 
The same relative results for the different estimators are therefore obtained. 



5 Conclusion 

In this paper we provide a comprehensive overview of estimators for spatial autoregressive 
models with binary dependent variables. These models are of particular interest to various ap- 
plications in economics, political science, and related disciplines, where the outcome might be 
a policy, a decision, a transition, or otherwise binary outcome. Applications can also be imag- 
ined in the field of bioinformatics or neuroscience, although sample sizes tend to be magnitudes 
larger than those studied here. Many of these outcomes are interdependent through either spa- 
tial contiguity or any other form of proximity, including social networks or economic linkages. 
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and ignoring the inherent spatial structure of the data generates inconsistent and inefficient 
estimates ([McMillen] |1992[). Furthermore, in many applications the researcher is explicitly 



concerned with estimating the level of interdependence - it is indeed this concern that is of 
primary interest in our discussion and simulation study. 

This paper compares five estimators introduced to date for this specific type of model. An 
extensive simulation study compares the performance of these five estimators under conditions 
of relatively small sample sizes and varying levels of spatial autocorrelation. When taking both 
the estimation of the extend of spatial autocorrelation and the coefficients on the other explana- 
tory variables into account, the Gibbs estimator ^LeSage[ 2000, ) clearly outperforms the other 
estimators. When the sample size increases, the difference between the different estimators 
becomes smaller. When focusing specifically on the spatial autoregressive component alone, 
the Gibbs estimator (LeSage, 2000[ ) performs best for low spatial autocorrelation, while the 
Recursive Importance Sampler (Beron and Vijverberg[ 2004") performs best for high spatial 
autocorrelation. The linearized GMM estimator ( Klier and McMillen^ ,2008) is an interesting 
option when the sample size is large and the autocorrelation relatively low, in particular due to 
its high computational speed. 
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Appendix: Derivative for the linearization of the GMM BSAR 
model 

The gradient ( [T6] ) of interest to the linearization proposed in I Klier and McMillen| ( |2008| ) is the 
derivative of the logistic link function to the spatial autoregressive parameter p: 



dPj d I d 



1 -^exp 



where ^' = (I — pW) and q as in equation We derive the following gradient 



Oei dp 



where h is defined in equation ( 16 ) and 

dOn d 



dp dp 



2c7, 



If ^ is symmetric, equation ( 17) simplifies to: 

do.i 1 



dp 



(17) 



which leads to equation ( 16 1. 
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